By Dimitre Novatchev |
Language | XSLT |
Category | xml, designpatterns, msxml |
Posted | 10 Mar 2001 |
Updated | 21 Mar 2001 |
Summary |
In my last, Binary Search snippet I mentioned that the algorithm has many and very important applications, and hinted I was going to use it in sorting. This is precisely what I'm going to do now. |
Here's how our web-administrator Trace Wilson formulated the problem: |
"Hi Daryl, |
I've had the same problem for quite a while. |
What I do is first is find all items before a date, then I sort on my date |
(see snippet http://www.vbxml.com/snippetcentral/main.asp?view=viewsnippet&id=v20010125215939\) |
There are another 2 there on dates. |
From there I need to check if the position() is less than x, which means that my whole selection gets iterated through. If you can think of a nicer solution, I'd love to hear your ideas. I asked earlier on this and Dimitre mentioned that I really had no choice in the matter. But he has just supplied a solution for breaking out of a for-each which I'm going to try next week.". |
What I was offering to Trace then was to stop performing full sort (with complexity of Log2(N)*N and to find k (k=5 in the real case -- anyway, something much smaller than N) times the maximum dates instead. The latter has a complexity of O(N) -- this means that the complexity is some linear function of N -- c*N. |
In this case it is obvious that c=k. |
While this solution is clearly much better, there's still the possibility that another O(N) solution might exist with complexity c2*N, where c2 is signifficantly smaller than c. |
Interesting, isn't it? :o)) |
OK, let's analyse the problem. |
When finding the maximum element of a node-set we have a single work pad where to keep the current maximum value. We scan through the set one element at a time and compare it to the value in the workpad. Whenever the current element is greater, we produce a new workpad and put the new number there (for reasons of simplicity I'll speak about "numbers", not "elements", but we'll know that this is "the number, corresponding to the element"). |
The algorithm ends when we have performed a full scan of the set of elements. If the set has N elements, there will be N comparisons. To perform this algorithm k times will require k*N comparisons. |
How can we find the k biggest elements even faster? |
Hmmm... if only we could scan the set only once... Wait, that's an interesting idea, why not? |
Imagine that we have a bigger workpad that can hold k numbers. At any step through the elements of the set this workpad will hold the k biggest numbers up to now. If the current element is bigger than at least one in the workpad, then we construct a new workpad, in which the smallest (up to now) number has been dropped and the current number has been added. |
In this way we will perform only one pass over the set of elements and at the end the workpad will contain the k biggest elements. |
Wow... isnt that nice! |
Yes, but... does this really have smaller complexity? We scan only N elements but every element's number has to be compared now with the k numbers in the workpad -- this still means k*N comparisons... :((( |
OK, now we're ready for the really big step forward -- what if we keep the numbers in the workpad always sorted? Will we then really require k comparisons? |
Wait... this reminds me of something... didn't I write another snippet just the day before and said it could be used for very efficient sorting? Binary Search...? |
Here's the idea: We modify slightly the Binary Search snippet so that it will always return the position a number must have in the sorted workpad if we want to add it there and still keep the workpad sorted. |
What will be the complexity then? A binary search requires just Log2(k) comparisons -- hence the total complexity will be Log2(k)*N. |
So, compared to the "k-times the maximum" this algorithm will be k div Log2(k) times faster. |
In practical terms, if we're getting 5 out of N, this will be slightly more than twice faster, if we're getting 10 out of N this will be more than 3 times faster, if we're getting 100 of N (think N is thousands) this will be almost 7 times faster. |
Notice also, that if we make the workpad hold N elements, then we have (our own!) super fast sorting algorithm. The complexity is the same as that of "xsl:sort", but we now have huge flexibility to define and use in the sort our own comparison operators for any (including our own!) data types. Good result, isnt it? |
Is this the fastest sort possible? Well, I know of at least one search algorithm, which is faster than binary search. May give it a try... :o)) |
Now, how we code this in XSLT? |
Let's have the following source xml document: |
|
I preferred to represent the workpad as a string containing 5 8-digit numers, each preceded by a delimiter. Due to lack of fantasy I chose the space as delimiter. Why use a delimiter? |
Just try to debug using these five 8-digit numbers pressed together and you'll know the answer. |
So, to create a new workpad from the current one will require copying everything (old first number excluded!) from the second number to the merge position (a good possibility to use the substring() function), copying the new number, then copying everything from the old workpad that follows the merge position. |
Special care was taken to efficiency -- a most important example is how the old workpad is reused if the current number doesn't make it there. This will save a tremendous amount of memory in all but the trivial cases. |
Also, the modifications to the Binary Search are quite small -- just note, that the merge position will be calculated differently, depending on whether the last comparison was ">" or "<". |
When we apply the stylesheet below to the source xml document, we'll get the following correct result: |
"Result: 19990417 20001123 20010406 20020417 20100714" |
And here's the complete stylesheet: |
|